众所周知,大数据挖掘是数据科学的重要任务,因为它可以提供有用的观察结果和隐藏在给定的大数据集中的新知识。基于接近性的数据分析尤其在许多现实生活中使用。在这样的分析中,通常采用了与K最近的邻居的距离,因此其主瓶颈来自数据检索。为提高这些分析的效率做出了许多努力。但是,他们仍然会产生巨大的成本,因为它们基本上需要许多数据访问。为了避免此问题,我们提出了一种机器学习技术,该技术可以快速准确地估算给定查询的K-NN距离(即与K最近的邻居的距离)。我们训练完全连接的神经网络模型,并利用枢轴来实现准确的估计。我们的模型旨在具有有用的优势:它一次不距离K-NN,其推理时间为O(1)(未产生数据访问),但保持高精度。我们对实际数据集的实验结果和案例研究证明了解决方案的效率和有效性。
translated by 谷歌翻译
In various fields of data science, researchers are often interested in estimating the ratio of conditional expectation functions (CEFR). Specifically in causal inference problems, it is sometimes natural to consider ratio-based treatment effects, such as odds ratios and hazard ratios, and even difference-based treatment effects are identified as CEFR in some empirically relevant settings. This chapter develops the general framework for estimation and inference on CEFR, which allows the use of flexible machine learning for infinite-dimensional nuisance parameters. In the first stage of the framework, the orthogonal signals are constructed using debiased machine learning techniques to mitigate the negative impacts of the regularization bias in the nuisance estimates on the target estimates. The signals are then combined with a novel series estimator tailored for CEFR. We derive the pointwise and uniform asymptotic results for estimation and inference on CEFR, including the validity of the Gaussian bootstrap, and provide low-level sufficient conditions to apply the proposed framework to some specific examples. We demonstrate the finite-sample performance of the series estimator constructed under the proposed framework by numerical simulations. Finally, we apply the proposed method to estimate the causal effect of the 401(k) program on household assets.
translated by 谷歌翻译
Background and objective: COVID-19 and its variants have caused significant disruptions in over 200 countries and regions worldwide, affecting the health and lives of billions of people. Detecting COVID-19 from chest X-Ray (CXR) images has become one of the fastest and easiest methods for detecting COVID-19 since the common occurrence of radiological pneumonia findings in COVID-19 patients. We present a novel high-accuracy COVID-19 detection method that uses CXR images. Methods: Our method consists of two phases. One is self-supervised learning-based pertaining; the other is batch knowledge ensembling-based fine-tuning. Self-supervised learning-based pretraining can learn distinguished representations from CXR images without manually annotated labels. On the other hand, batch knowledge ensembling-based fine-tuning can utilize category knowledge of images in a batch according to their visual feature similarities to improve detection performance. Unlike our previous implementation, we introduce batch knowledge ensembling into the fine-tuning phase, reducing the memory used in self-supervised learning and improving COVID-19 detection accuracy. Results: On two public COVID-19 CXR datasets, namely, a large dataset and an unbalanced dataset, our method exhibited promising COVID-19 detection performance. Our method maintains high detection accuracy even when annotated CXR training images are reduced significantly (e.g., using only 10% of the original dataset). In addition, our method is insensitive to changes in hyperparameters. Conclusions: The proposed method outperforms other state-of-the-art COVID-19 detection methods in different settings. Our method can reduce the workloads of healthcare providers and radiologists.
translated by 谷歌翻译
Purpose: Considering several patients screened due to COVID-19 pandemic, computer-aided detection has strong potential in assisting clinical workflow efficiency and reducing the incidence of infections among radiologists and healthcare providers. Since many confirmed COVID-19 cases present radiological findings of pneumonia, radiologic examinations can be useful for fast detection. Therefore, chest radiography can be used to fast screen COVID-19 during the patient triage, thereby determining the priority of patient's care to help saturated medical facilities in a pandemic situation. Methods: In this paper, we propose a new learning scheme called self-supervised transfer learning for detecting COVID-19 from chest X-ray (CXR) images. We compared six self-supervised learning (SSL) methods (Cross, BYOL, SimSiam, SimCLR, PIRL-jigsaw, and PIRL-rotation) with the proposed method. Additionally, we compared six pretrained DCNNs (ResNet18, ResNet50, ResNet101, CheXNet, DenseNet201, and InceptionV3) with the proposed method. We provide quantitative evaluation on the largest open COVID-19 CXR dataset and qualitative results for visual inspection. Results: Our method achieved a harmonic mean (HM) score of 0.985, AUC of 0.999, and four-class accuracy of 0.953. We also used the visualization technique Grad-CAM++ to generate visual explanations of different classes of CXR images with the proposed method to increase the interpretability. Conclusions: Our method shows that the knowledge learned from natural images using transfer learning is beneficial for SSL of the CXR images and boosts the performance of representation learning for COVID-19 detection. Our method promises to reduce the incidence of infections among radiologists and healthcare providers.
translated by 谷歌翻译
Simulating quantum channels is a fundamental primitive in quantum computing, since quantum channels define general (trace-preserving) quantum operations. An arbitrary quantum channel cannot be exactly simulated using a finite-dimensional programmable quantum processor, making it important to develop optimal approximate simulation techniques. In this paper, we study the challenging setting in which the channel to be simulated varies adversarially with time. We propose the use of matrix exponentiated gradient descent (MEGD), an online convex optimization method, and analytically show that it achieves a sublinear regret in time. Through experiments, we validate the main results for time-varying dephasing channels using a programmable generalized teleportation processor.
translated by 谷歌翻译
This paper solves a generalized version of the problem of multi-source model adaptation for semantic segmentation. Model adaptation is proposed as a new domain adaptation problem which requires access to a pre-trained model instead of data for the source domain. A general multi-source setting of model adaptation assumes strictly that each source domain shares a common label space with the target domain. As a relaxation, we allow the label space of each source domain to be a subset of that of the target domain and require the union of the source-domain label spaces to be equal to the target-domain label space. For the new setting named union-set multi-source model adaptation, we propose a method with a novel learning strategy named model-invariant feature learning, which takes full advantage of the diverse characteristics of the source-domain models, thereby improving the generalization in the target domain. We conduct extensive experiments in various adaptation settings to show the superiority of our method. The code is available at https://github.com/lzy7976/union-set-model-adaptation.
translated by 谷歌翻译
数据集复杂性评估旨在在训练分类器之前先预测具有复杂性计算的数据集上的分类性能,该分类器也可以用于分类器选择和减少数据集。深卷积神经网络(DCNN)的训练过程是迭代的且耗时的,这是由于高参数的不确定性和不同数据集引入的域移位。因此,通过在培训DCNN模型之前有效评估数据集的复杂性来预测分类性能是有意义的。本文提出了一种新的方法,称为Laplacian Spectrum(CMSAUL)下的累积最大缩放区域,该方法可以在六个数据集上实现最新的复杂性评估性能。
translated by 谷歌翻译
背景和目标:需要分享医疗数据以实现医疗保健信息的跨机构流量并构建高准确的计算机辅助诊断系统。但是,大量的医疗数据集,保存深度卷积神经网络(DCNN)模型的大量记忆以及患者的隐私保护是可能导致医疗数据共享效率低下的问题。因此,本研究提出了一种新型的软标签数据集蒸馏方法,用于医疗数据共享。方法:所提出的方法提炼医疗图像数据的有效信息,并生成几个带有不同数据分布的压缩图像,以供匿名医疗数据共享。此外,我们的方法可以提取DCNN模型的基本权重,以减少保存训练有素的模型以进行有效的医疗数据共享所需的内存。结果:所提出的方法可以将数万张图像压缩为几个软标签图像,并将受过训练的模型的大小减少到其原始大小的几百分之一。蒸馏后获得的压缩图像已在视觉上匿名化;因此,它们不包含患者的私人信息。此外,我们可以通过少量压缩图像实现高检测性能。结论:实验结果表明,所提出的方法可以提高医疗数据共享的效率和安全性。
translated by 谷歌翻译
高级模型的采集取决于许多领域的大型数据集,这使存储数据集和培训模型昂贵。作为解决方案,数据集蒸馏可以合成一个小数据集,以便在其上训练有素的模型在与原始大型数据集的情况下达到高性能。通过匹配网络参数的最近提出的数据集蒸馏方法已被证明对多个数据集有效。但是,蒸馏过程中的一些参数很难匹配,这会损害蒸馏性能。基于此观察结果,本文提出了一种使用参数修剪来解决问题的新方法。提出的方法可以通过在蒸馏过程中修剪难以匹配的参数来合成更强大的蒸馏数据集并改善蒸馏性能。三个数据集的实验结果表明,所提出的方法的表现优于其他SOTA数据集蒸馏方法。
translated by 谷歌翻译
由于存在隐私保护问题以及传输和存储许多高分辨率医疗图像的巨大成本,因此在医院之间共享医疗数据集很具有挑战性。但是,数据集蒸馏可以合成一个小数据集,从而使对其进行训练的模型与原始大型数据集实现了可比的性能,这显示了解决现有的医疗共享问题的潜力。因此,本文提出了一种基于数据集蒸馏的新型医学数据集共享方法。Covid-19胸部X射线图像数据集的实验结果表明,即使使用稀缺的匿名胸部X射线图像,我们的方法也可以达到高检测性能。
translated by 谷歌翻译